March 25, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
This is the transformer
A self-attending encoder
A mostly self-attending decoder
Replace recurrence with positional encodings and multi-headed self-attention
A few additional bells and whistles to make everything tick
Just as good as SoTA recurrent models
Way lower training cost (orders of magnitude lower) due parallel construction
The full encoder-decoder transformer is most often used for machine translation tasks
A sequential input goes in
A sequential input is returned
Examples:
English to Spanish
Image to Caption
An example implementation in Keras can be found here.
Certain types of tasks never require predicting one word ahead - we just see the entire sequence at once and want to do something with it taking into account the sequential nature of the input!
Text classification
Image classification (?)
Part of speech learning
Next Sentence Prediction
The goal is to use the encoder from the transformer architecture to learn about the structure of the input in a clever parallelizable way
Treat the language problem as a representation problem
If we fully understand the context of the input, then we could fill in the gaps between and adequately describe the sentence in a contextual way
If we fully understand the context of a sentence, then we could find another sentence that it likely to come after it!
The strategy is to use the encoder of a transformer architecture to learn how to fill in the gaps for language!
Example:
A man wanted milk with his cookies. He went to the store to buy a gallon of milk.
Generation Context Success
Prompt :
Finish the following story: A man wanted milk with his cookies. He went to the store to buy a…
Output 1:
P(Gallon | Prompt) \(\approx\) 1
Output 2:
P(of Milk | Prompt + Gallon) \(\approx\) 1
In GPT style “1 token ahead” prediction, we’ve succeeded if we’re able to finish the story correctly.
Representational Context Success
Input:
<CLS> A man wanted milk with his <MASK1> <SEP> He went to the <MASK2> to buy a <MASK3> of milk <SEP>
Success is appropriately filling in the masks with the correct words
P(MASK1 = Cookies | \(\mathbf x\)) \(\approx\) 1
P(MASK2 = Store | \(\mathbf x\)) \(\approx\) 1
P(MASK3 = Gallon | \(\mathbf x\)) \(\approx\) 1
If we truly understand the context of a sentence, then we should be able to fill in the masks appropriately!
Training a ground truth model with masked language modeling:
Start with a large corpus of text.
Mask out 15% of the input words, then build a model to predict those masked words
The idea is that context is all about filling in the gaps!
This idea led to Bidirectional Encoder Representations from Transformers (BERT)
Compared to previous seq2seq models that only looked forward, use self attention to learn context
Self-attention looks backwards and forwards!
Restrict context to what we see and let looking one sentence/word ahead be one of the many tasks that the contextual representation can address
The novel idea was to make BERT a pre-processing step rather than a solver
BERT to Text Classification
BERT to Next Sentence Prediction
BERT to Word Level Classification
Create a full pre-trained transformer that is good at one thing, but can be easily fine-tuned for other tasks
GPT only does one task - next word prediction!
Different tasks can be given as part of the prompt
BERT is all about mastery of certain small tasks
Input:
<CLS> A man wanted milk with his <MASK1> <SEP> He went to the <MASK2> to buy a <MASK3> of milk <SEP>
Two special tokens for BERT:
<CLS>: The class token. Used to encode one or more classes for an input sentence.
<SEP>: Equivalent to <eos>. Separates two sentences that might be continguous.
The base task for the BERT pre-trained encoder is next sentence prediction.
Take a pair of sentences from a corpus:
If they are sequential, set the label to 1
If not, set to 0
Examples
Class = 1: <CLS> the man went to <MASK> store <SEP> he bought a <MASK> of milk <SEP>
Class = 0: <CLS> the man <MASK> to the store <SEP> penguins <mask> flight ##less birds <SEP>
The first BERT pre-training corpus:
Wikipedia (2.5B tokens) + BooksCorpus (.8B tokens)
Limit vocabulary size to 30k common sub-word units
Create 3 embedding types for each input token: word embeddings (GLoVE), Segment, and Positional
Two BERT models:
BERT-base: 12 self-attention blocks (self-attention + FCNN w/ ReLU), 768 dimensions for the embeddings and self-attention states throughout, 12 attention heads per self-attention compute, 110M parameters
BERT-large: 24 layers, 1024 hidden size, 16 attention heads, 340M parameters
For both:
Max sequence size of 512 (512 tokens per sentence pair, shorter length sentences padded to 512)
Learnable parameters: Attention Weight Matrices, FCNN weight matrices
Attention weights computed using weights and scaled dot products with other input tokens
At the end:
Weight matrices (which can be frozen and transferred to a different task)
Token specific hidden representations of size 768/1024 that contextualize the word around all other input words
Theoretically, a hidden space that vectorizes the human language!
Also:
A contextual next sentence classifier associated with the <CLS> token
A set of values that, when softmaxed, correspond to class probabilities!
A view of the flow with \(D\) tokens:
\(D\) \(M\)-dimensional vectors at the start
After each self-attention block, \(D\) \(M\)-dimensional vectors that are convex combinations of all of the input vectors weighted by dot product similarity
The final layer is a latent representation of the sentence in a very high-dimensional language space
The <CLS> token, then, tells us how the class is determined by all of the other tokens
And all of the combinations of tokens
And all of the combinations of combinations of tokens
And on and on and on
Specifically, \(\mathbf c\), the final representation for the <CLS> token is a \(M\)-dimensional vector that says how the class relates to the other tokens in the input!
Add a dense layer with a softmax activation that maps the values of \(\mathbf c\) to the class probabilities!!
Separate from the classification part from the encoder
Sounds familiar…
The big idea:
For CNNs, there is a certain point where the convolutional encoder with enough inputs describes all pictures we could see
For text, there is a certain point where the BERT self-attention layers with enough inputs describe the English language! Regardless of the classification task.
For a different classification problem:
Chop off the classification head of the model
Add a new class-labeled corpus (say positive or negative sentiment sentences)
Update the BERT weights for just a few epochs to adjust from the general problem to your specific problem
Get a state of the art sentence classifier!
This is a little expensive for the non-billion dollar company
Some tricks of the trade:
Use smaller BERT models - we don’t need the true SoTA for most applications.
Just update weights in the top layers of the model. If there are 12 self attention layers, just update the top 2 layers and classification weights.
Take advantage of HPC resources.
Let’s look at some example code!
Note: The full model topped out at roughly 17GB of VRAM memory. Can be done with a beefy GPU, but use smaller models or more clever pre-training to run on lesser resources
Transformers are data hungry
Modern ML is all about big data situations!
This is why transformers are all the rage right now
The entire English language? BIG
All images on Google? BIG
Sort of limits their usage for regular people like us…
The BERT steps:
Train an encoder and then classify
That sounds a lot like a CNN!
Can we use a self-attention based encoder for image problems?
Why might we want to do this?
Convolutions are computationally expensive
There are a lot of weights that you need to learn for a CNN
Roughly 25 million parameters
Lots of add/multiply operations - a 3 x 3 filter with same padding on a 256 x 256 image requires roughly 1.2 million add multiply operations to get the next feature map. Multiply that by 32 (number of different filters) and we’re computing roughly 40 million add/multiply operations per image in the training step for one convolutional layer!!!
For modern hardware, this takes 1/1000’s of a millisecond, but this adds up!
Why might we want to do this?
CNNs look in increasingly large neighborhoods for patterns
Start small and work out to generic parts
What if the image context isn’t really contiguous?
Self-attention across pixels could allow us to see patterns faster!
Spoiler: Vision Transformers are quite new and aren’t on par with CNNs yet for image classification
But, they are faster when the training sets are millions or billions of images!
Good to know. But, no need to use them unless you have a specific reason to do so.
Vision transformers are mostly convolution free image classifiers
Each pixel is a vector of length 3
Feed as a sequence to an encoder based transformer
This isn’t going to be viable.
Any thoughts as to why?
Think about a “small” image of size 256 x 256
Big BERT had a max sequence length of 512
An image with 256 x 256 pixels will turn into a 65536 pixel vector!
The breakthrough happened in 2021
Vision transformers are hot in the ML community right now.
Big improvement:
Combine the windowed approach of CNNs with Transformers via SWIN
SoTA performance on large classification tasks with a less compute time!
One of the bigger areas of research is using Vision Transformers for Object Detection
Complex models with Encoder/Decoder Architecture
Simple models with just patched encoder!
We’ll dig into generative models more over the final part of this class.
Just a quick overview/solution today
Recall that all machine learning is about learning a probability distribution
Most commonly supervised learning
Under some light assumptions:
\[ f(\mathbf x) = P(y | \mathbf x) \]
There’s also unsupervised learning
Different goal: Learn structure within the input under set of assumptions
Given a set of observations, \(\mathbf x_i \in \mathbf X\), learn about the structure of the inputs
\[ f(\mathbf x) = P(\mathbf x) \]
What are some methods you’ve already seen (this class and others) that perform unsupervised learning? What do they learn?
These methods are designed to learn structure in an input corpus
Clustering - find groups of inputs that are similar
PCA - project high dimensional data to a lower dimensional subspace that retains information
Density estimation - learn the structure of the data in terms of likelihood
Another unsupervised task
Given a set of books by Jane Austen, generate a new book that reads like it was written by her!
The thought:
If we can describe the structure of what makes these books sound like Jane Austen, then we could sample from the structure to get a new book!
What about images? Suppose we have a collection of animal images. How can we describe the relationship between pixels to know if it is an animal or not? How could we plausibly generate a new picture not included in the training set that plausibly looks like an animal?
What makes an animal a animal?
Legs
Fur?
Mouth?
Hard to say! But, it all just boils down to how they look
If we learn the hidden code, then we could generate new plausible images of animals by drawing a new sample from \(P(\mathbf X)\)
Conditional Generation
Maybe we don’t just want an animal. We want a new image of a cat or dog!
Find the parts of \(P(\mathbf X)\) that correspond to our label
A conditional density
\[ P(\mathbf x | \mathbf y = \text{Cat}) \]
How can we get this conditional density using things we’ve already seen?
Good ol’ Bayes rule:
\[ P(\mathbf x | \mathbf y = Cat) = \frac{P(\mathbf y = \text{Cat} | \mathbf x) P(\mathbf x)}{P(\mathbf y = Cat)} \]
The first term is our discriminator - we know how to get this!
The second term is the joint distribution of all animal images
The third term is the proportion of the distribution of all animal images that belongs to cats
Find the joint density of the labels and the images!
The goal of generative modeling is to come up with clever ways of learning the density of inputs!
Once we learn the density, we can sample from it to create new instances!
This is very hard
We’re asking a computer to understand what makes a cat a cat without telling it that a cat is a cat
The moment we tell it that the cat is a cat and condition on the image, we’ve restricted what we can learn!
Create machines that act like humans
“Simple” Generative models can create images that are indistinguishable from real images for simple things
It gets a little harder for more complex tasks
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
A collection of images
A collection of sentences
The easiest way to approach this is with a common probability identity.
Let \(\mathbf x\) be a vector of inputs - \([x_1,x_2,....,x_P]\)
The joint density over inputs is then:
\[ P(\mathbf x) = f(x_1,x_2,x_3,...,x_P) \]
The probability chain rule:
\[ f(x_1,x_2,x_3,...,x_P) = f(x_1)f(x_2 | x_1)f(x_3 | x_1,x_2)...f(x_P | x_{P-1}, x_{P-2},...) \]
If it is possible to learn the conditional density of the next input given the previous inputs, then we’ve learned the joint density of the inputs!
Does this look like anything we’ve recently talked about?
Generative Pretrained Transformers are autoregressive generators
Find the weights for masked self-attention that maximize the probability that we generate the correct next word
Learns \(P(\mathbf x)\) instead of \(P(y | \mathbf x)\)!
Technically, provides \(\underset{\mathbf x}{\text{argmax}} P(\mathbf x)\)
Using GPT for classification is kinda a square peg in a round hole
A lot like K-Means + Logistic Regression for classification
Semi-supervised learning
ChatGPT is trained on the entirety of the English language
Conditional language generation follows the same principle:
Let \(\mathbf y\) be the prompt token
\[ P(\mathbf x | \mathbf y) = f(x_1 | \mathbf y)f(x_2 | \mathbf y , x_1)... \]
At this point in time, this is the only real dog in the text generation fight
Easy to understand if you understand transformers
This one-ahead approach seems to work remarkably well for language generation
I have a hard time thinking about the next frontier in text generation - just improvements to the GPT training process
Can we generate images this way?
Sorta!
The first proposed method was the PixelRNN (Van den Oord, 2016)
Assume images are generated as rectangular pixel grids
Start in the top left corner, \(x_{11}\)
Take a draw from two conditional densities to get next two pixels, \(f(x_{12} | x_{11})\) and \(f(x_{21} | x_{11})\)
Get the next 4 pixels
Then 8…
The probability model uses the LSTM architecture.
Assume each RGB vector for each pixel depends on a hidden state that is determined recurrently from its neighbors
\[ \mathbf h_{x,y} = f(\mathbf h_{x-1,y}, \mathbf h_{x,y-1}) \]
Train over a large corpus to learn all of the hidden state transitions!
Implicitly depends on all previously seen pixels!
This works quite well in some situations!
What do you think are the drawbacks to this approach for image generation?
Advancements in recurrent image generation:
Pixel CNN - exchange recurrence for CNN style windowing; a little faster
ImageGPT - exchange recurrence for masked self-attention; a little faster with really large training sets
These do a great job when we can compute them in finite time!
Painfully slow!!!
ImageGPT has only been shown to work for up to 64 x 64 input images
Literal days of training time to learn joint densities for small-ish data sets
Can’t be used generally to create a dictionary of all images!
Advances in autoregressive generation (like DALL-E) required leveraging other stuff first!
That’s what we’ll start talking about next time